Rodney Dyer, PhD
Describe the daytime air temperatures at the Rice Rivers Center for February, 2014 by day of the week.
To do this, we have the following general workflow.
There are a finite number of action verbs that can be used on raw data. They are combined to yield meaningful inferences from our data.
Identify only subset of data columns that you are interested in using.
Use only some subset of rows in the data based upon qualities within the columns themselves.
Reorder the data using values in one or more columns to sort.
Convert one data type to another, scaling, combining, or making any other derivative component.
Perform operations on the data to characterize trends in the raw data as summary statistics.
Partition the data set into groups based upon some taxonomy of categorization.
The manner in which we organize these action verbs yields an infinite number of combinations.
The Treachery of Images
In R we use this grammar.
To take the values in data and pass them as if you entered the data as the first argument to the function Y().
The data we will be working with consist of data from the Rice Rivers Center.
url <- "https://docs.google.com/spreadsheets/d/1Mk1YGH9LqjF7drJE-td1G_JkdADOU0eMlrP01WFBT8s/pub?gid=0&single=true&output=csv"
rice <- read_csv( url )
names( rice ) [1] "DateTime" "RecordID"
[3] "PAR" "WindSpeed_mph"
[5] "WindDir" "AirTempF"
[7] "RelHumidity" "BP_HG"
[9] "Rain_in" "H2O_TempC"
[11] "SpCond_mScm" "Salinity_ppt"
[13] "PH" "PH_mv"
[15] "Turbidity_ntu" "Chla_ugl"
[17] "BGAPC_CML" "BGAPC_rfu"
[19] "ODO_sat" "ODO_mgl"
[21] "Depth_ft" "Depth_m"
[23] "SurfaceWaterElev_m_levelNad83m"
Using the column numbers instead of names.
DateTime PAR WindDir PH
Length:8199 Min. : 0.000 Min. : 0.00 Min. :6.43
Class :character 1st Qu.: 0.000 1st Qu.: 37.31 1st Qu.:7.50
Mode :character Median : 0.046 Median :137.30 Median :7.58
Mean : 241.984 Mean :146.20 Mean :7.60
3rd Qu.: 337.900 3rd Qu.:249.95 3rd Qu.:7.69
Max. :1957.000 Max. :360.00 Max. :9.00
NA's :1
Column names are probably better than column numbers
Additional assistance from RStudio via pop-ups.
Longer term readability (like next Tuesday & Beyond!)
RStudio for a data frame in memoryThe dplyr library defines a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges using select, filter, arrange, mutate, group_by, and summarise functionality.
It is part of tidyverse
In tidyverse we can use the column names and do not need to be quoted.
The select() function allows you to choose many columns of data to work with.
There are times when we want to
[1] "DateTime" "RecordID"
[3] "PAR" "WindSpeed_mph"
[5] "WindDir" "AirTempF"
[7] "RelHumidity" "BP_HG"
[9] "Rain_in" "H2O_TempC"
[11] "SpCond_mScm" "Salinity_ppt"
[13] "PH" "PH_mv"
[15] "Turbidity_ntu" "Chla_ugl"
[17] "BGAPC_CML" "BGAPC_rfu"
[19] "ODO_sat" "ODO_mgl"
[21] "Depth_ft" "Depth_m"
[23] "SurfaceWaterElev_m_levelNad83m"
Love the everything() function.
[1] "RecordID" "ODO_mgl"
[3] "PH" "DateTime"
[5] "PAR" "WindSpeed_mph"
[7] "WindDir" "AirTempF"
[9] "RelHumidity" "BP_HG"
[11] "Rain_in" "H2O_TempC"
[13] "SpCond_mScm" "Salinity_ppt"
[15] "PH_mv" "Turbidity_ntu"
[17] "Chla_ugl" "BGAPC_CML"
[19] "BGAPC_rfu" "ODO_sat"
[21] "Depth_ft" "Depth_m"
[23] "SurfaceWaterElev_m_levelNad83m"
The function filter() works to select records (rows) based upon some criteria.
We can sort entire data.frame objects based upon the values in one or more of the columns using the arrange() function.
To reverse the order, use the negative operator on the column name object in the function.
You can also sort using several criteria.
The mutate() function creates new columns of data.
[1] "POSIXct" "POSIXt"
You can make several mutations in one call or you can pipe several mutation events at one time.
The summarize() function derives inferences from the current data.frame and produces a new one.
The group_by() function allows you to arbitrarily pull together subset of data and prepare them to be worked on by something like summary().
rice %>%
mutate( Date = mdy_hms( DateTime,
tz="EST"),
Month = month( Date,
abbr = FALSE,
label=TRUE) ) %>%
group_by( Month ) %>%
summarize( `Air Temp` = mean( AirTempF),
`Water Temp` = mean( H2O_TempC,
na.rm=TRUE) ) # A tibble: 3 × 3
Month `Air Temp` `Water Temp`
<ord> <dbl> <dbl>
1 January 34.7 3.68
2 February 39.7 5.29
3 March 42.6 7.96
Here are some strategies to consider.
head(), summary(), or View() to take a look at what is coming out of my workflow to make sure it resembles what I think it should look like.There are times when working with data that we may want to use reasonable names.
[1] "DateTime" "RecordID"
[3] "PAR" "WindSpeed_mph"
[5] "WindDir" "AirTempF"
[7] "RelHumidity" "BP_HG"
[9] "Rain_in" "H2O_TempC"
[11] "SpCond_mScm" "Salinity_ppt"
[13] "PH" "PH_mv"
[15] "Turbidity_ntu" "Chla_ugl"
[17] "BGAPC_CML" "BGAPC_rfu"
[19] "ODO_sat" "ODO_mgl"
[21] "Depth_ft" "Depth_m"
[23] "SurfaceWaterElev_m_levelNad83m"
If the names are set reasonably in the workflow, then they will be piped directly into tables and figures correctly.